Transportation Data Analysis

Chapter 1: Public Bicycles Sharing Systems

Alberto González Paje

2016-12-28

Introduction to this chapter

This chapter covers data analysis of public bike sharing systems around the world. It’s the first of a few chapters covering different transport means, such as airplanes, busese, cars and others. The idea of this project is to apply data analysis techniques to real world data sets. In this first chapter I will show a full data analysis cycle, starting in gathering data and ending by creating a prediction model of public bike sharing system capacity in London city.

This document is built using markdown syntax and then generated HTML code for publishing the whole on the Internet. The flow is basically splitted in three parts, gathering, storing and modeling data.

Get Public Bicicles Sharing Systems

The following code is coming from get_citybik.R In order to gather data, I will be using CityBikes API www.citybik.es where you can access in real time to data of dozens of public bike sharing systems around the world.

1.- Get list of all systems involved

First we need to set our current working directory. Beware of changing this working directory if you happen to be working with these data by your own.

Get a list of all networks using the network endpoint. Basically I want to know which bike systems are available using this endpoint.

# Get all networks involved
network_endpoint = "http://api.citybik.es/v2/networks"
network_endpoint_df <- jsonlite::fromJSON(network_endpoint, simplifyDataFrame = TRUE)

Get valid data from the list of all networks and incorporate a column with the suffix URL.

# Refine endpoint output
company_names = network_endpoint_df$networks
company_ids = company_names$id
company_data = company_names$location
company_url = company_names$href

# Build a suffix to endpoint
suffix = "http://api.citybik.es"
company_url = paste(suffix,company_names$href,sep="")
company_df = cbind(company_ids,company_data,company_url)

This is the resulting table:

A subset of company_df.

company_ids city country latitude longitude company_url
opole-bike Opole PL 50.66450 17.92760 http://api.citybik.es/v2/networks/opole-bike
wroclawski-rower-miejski Wrocław PL 51.10970 17.04850 http://api.citybik.es/v2/networks/wroclawski-rower-miejski
veturilo Warszawa PL 52.23450 21.00240 http://api.citybik.es/v2/networks/veturilo
bubi Budapest HU 47.49791 19.04023 http://api.citybik.es/v2/networks/bubi
kvb-rad Köln DE 50.93640 6.96053 http://api.citybik.es/v2/networks/kvb-rad
bemowo-bike Warszawa PL 52.26060 20.92900 http://api.citybik.es/v2/networks/bemowo-bike

Export a CSV file with the previous dataframe just in case we need this information at any time.

# Export list of companies
write.csv(company_df,file = "list_of_bikes_system_worldwide.csv",row.names = FALSE)

2.- Get data from all systems involved

Now I am going to create a function to get data from all our systems.

A subset of company_df.

company_ids city country latitude longitude company_url
opole-bike Opole PL 50.66450 17.92760 http://api.citybik.es/v2/networks/opole-bike
wroclawski-rower-miejski Wrocław PL 51.10970 17.04850 http://api.citybik.es/v2/networks/wroclawski-rower-miejski
veturilo Warszawa PL 52.23450 21.00240 http://api.citybik.es/v2/networks/veturilo
bubi Budapest HU 47.49791 19.04023 http://api.citybik.es/v2/networks/bubi
kvb-rad Köln DE 50.93640 6.96053 http://api.citybik.es/v2/networks/kvb-rad
bemowo-bike Warszawa PL 52.26060 20.92900 http://api.citybik.es/v2/networks/bemowo-bike

I will start with one system and then apply the same function to a set of cities, although you could do it for the >400 cities included in this magnificent API.

# 2.- For each system, get data (build DB and cron job)

# Build a function to read, parse and store endpoint data
bike_input = "http://api.citybik.es/v2/networks/wroclawski-rower-miejski"
# Read one system
bike_input_df = jsonlite::fromJSON(bike_input, simplifyDataFrame = TRUE)

bike_input_df_2 = bike_input_df$network
bike_input_df_3 = bike_input_df_2$stations

bike_df = bike_input_df_3[c(1,3:8)]
total_slots = bike_input_df_3$extra$slots
bike_df$total_slots = total_slots

The resulting table of capacity data in a given city:

A subset of bike_df.

empty_slots free_bikes id latitude longitude name timestamp total_slots
15 0 52515b944e4a64584def861922aa6db1 51.12528 16.98439 Legnicka - Wejherowska 2016-12-28T16:53:54.843000Z 15
15 0 dfe66f3e0cb8ac29b2386df9d68341d7 51.13208 17.06550 Pl. Kromera 2016-12-28T16:53:54.800000Z 15
13 0 0138e16d65d5db5b6e15970b73686b1a 51.10723 17.06133 Politechnika Wrocławska - Gmach Główny 2016-12-28T16:53:54.822000Z 14
15 0 7daa95e8b08a9d4b1401d5d7ed71e5c7 51.10098 17.00830 Pereca - Grabiszyńska 2016-12-28T16:53:54.832000Z 15
15 0 ae5d14252719f50d17084268b9906747 51.09804 17.00680 Żelazna - Pereca 2016-12-28T16:53:54.843000Z 15
15 0 d56c50f2935797bcffcba9f4237ded56 51.11943 17.05151 Wyszyńskiego - Prusa 2016-12-28T16:53:54.852000Z 15

You can see some of the cities included in the API.

# Get id + url
input_array = company_df[c(1,6)]

A subset of input_array.

company_ids company_url
opole-bike http://api.citybik.es/v2/networks/opole-bike
wroclawski-rower-miejski http://api.citybik.es/v2/networks/wroclawski-rower-miejski
veturilo http://api.citybik.es/v2/networks/veturilo
bubi http://api.citybik.es/v2/networks/bubi
kvb-rad http://api.citybik.es/v2/networks/kvb-rad
bemowo-bike http://api.citybik.es/v2/networks/bemowo-bike

I will build two arrays with key information, the city id and its API url, which are needed to build the final query.

A subset of array_id.

array_id
opole-bike
wroclawski-rower-miejski
veturilo
bubi
kvb-rad
bemowo-bike

A subset of array_url.

array_url
http://api.citybik.es/v2/networks/opole-bike
http://api.citybik.es/v2/networks/wroclawski-rower-miejski
http://api.citybik.es/v2/networks/veturilo
http://api.citybik.es/v2/networks/bubi
http://api.citybik.es/v2/networks/kvb-rad
http://api.citybik.es/v2/networks/bemowo-bike

As previously shown, let’s create a function to gather data from one city, and then apply that same function to more than one city.

# Create the function for one system
foo = function(url,id){
  input = as.character(url)
  read_input = jsonlite::fromJSON(input, simplifyDataFrame = TRUE)
  input_df_2 = read_input$network
  input_df_3 = input_df_2$stations
  input_df = input_df_3[c(1,3:8)]
  input_df$company_id = as.character(id)
  return (input_df)
}

Now it’s time to test the function in one city and then to a small amouny of cities and ultimately, if you wish, to the whole array of cities.

# Test fuction for one:
aa = array_url[2, ] 
aa
## [1] http://api.citybik.es/v2/networks/wroclawski-rower-miejski
## 457 Levels: http://api.citybik.es/v2/networks/adcb-bikeshare ...
a = foo(aa,"wroclawski-rower-miejski")

A subset of one system output.

empty_slots free_bikes id latitude longitude name timestamp company_id
15 0 52515b944e4a64584def861922aa6db1 51.12528 16.98439 Legnicka - Wejherowska 2016-12-28T16:56:58.738000Z wroclawski-rower-miejski
15 0 dfe66f3e0cb8ac29b2386df9d68341d7 51.13208 17.06550 Pl. Kromera 2016-12-28T16:56:58.703000Z wroclawski-rower-miejski
13 0 0138e16d65d5db5b6e15970b73686b1a 51.10723 17.06133 Politechnika Wrocławska - Gmach Główny 2016-12-28T16:56:58.726000Z wroclawski-rower-miejski
15 0 7daa95e8b08a9d4b1401d5d7ed71e5c7 51.10098 17.00830 Pereca - Grabiszyńska 2016-12-28T16:56:58.732000Z wroclawski-rower-miejski
15 0 ae5d14252719f50d17084268b9906747 51.09804 17.00680 Żelazna - Pereca 2016-12-28T16:56:58.736000Z wroclawski-rower-miejski
15 0 d56c50f2935797bcffcba9f4237ded56 51.11943 17.05151 Wyszyńskiego - Prusa 2016-12-28T16:56:58.751000Z wroclawski-rower-miejski

The next rows are meant to apply the foo (get data from one system) function to the 452 systems available in Citybikes.

# Apply function to all the bike systems

# u = array_url[1:452,]
# i = array_id[1:452,]
# 
# foo_all <- lapply(u,foo,i)

So the above chunk shows the techniques to play with the API. Now I will start to get real time data of eight european cities each hour and store that data in a SQLite Database.

3.- Get data from selected systems and store them in SQLite DB

Using the function created before, I test first one city.

# Test one city.

u = "http://api.citybik.es/v2/networks/dublinbikes"
i = "dublin"

city_ind = foo(u,i)

A subset of one city_ind.

empty_slots free_bikes id latitude longitude name timestamp company_id
30 5 b1c774f178240d5d757854364dafc840 53.34143 -6.246720 FENIAN STREET 2016-12-28T16:56:45.830000Z dublin
29 1 135d1f3410c5dbcc49c55a1d522dbbe6 53.34664 -6.246154 CITY QUAY 2016-12-28T16:56:45.844000Z dublin
34 6 153ff4dfb7bd8912ef91c10849129c2e 53.33521 -6.250900 FITZWILLIAM SQUARE EAST 2016-12-28T16:56:45.893000Z dublin
21 9 516ee41c56889c29b8fa5f4ac0e824f4 53.33901 -6.300217 BROOKFIELD ROAD 2016-12-28T16:56:45.815000Z dublin
27 13 6bff19daf1747a1e956a899ef8021232 53.34071 -6.308191 EMMET ROAD 2016-12-28T16:56:45.896000Z dublin
20 15 11b359924d83d802a41856c405d13e61 53.33878 -6.303950 ROTHE ABBEY 2016-12-28T16:56:45.891000Z dublin

Then I create (just the first time) the SQLite database nd store the results for future analysis.

Set up a cron job to get and store data for the selected list of systems with the following code (see store_citybik.R)

In this following code chunk you will see the case of one city (Dublin) but if you go to store_citybik.R you will see how I got data from eight different cities in the cron job.

## [1] 101   8

A subset of one city_ind.

empty_slots free_bikes id latitude longitude name timestamp company_id
30 5 b1c774f178240d5d757854364dafc840 53.34143 -6.246720 FENIAN STREET 2016-12-28T16:56:45.830000Z dublin
29 1 135d1f3410c5dbcc49c55a1d522dbbe6 53.34664 -6.246154 CITY QUAY 2016-12-28T16:56:45.844000Z dublin
34 6 153ff4dfb7bd8912ef91c10849129c2e 53.33521 -6.250900 FITZWILLIAM SQUARE EAST 2016-12-28T16:56:45.893000Z dublin
21 9 516ee41c56889c29b8fa5f4ac0e824f4 53.33901 -6.300217 BROOKFIELD ROAD 2016-12-28T16:56:45.815000Z dublin
27 13 6bff19daf1747a1e956a899ef8021232 53.34071 -6.308191 EMMET ROAD 2016-12-28T16:56:45.896000Z dublin
20 15 11b359924d83d802a41856c405d13e61 53.33878 -6.303950 ROTHE ABBEY 2016-12-28T16:56:45.891000Z dublin

Once you got data, you need to insert it into the database.

## [1] TRUE

4.- Read DB and Model

The final part of this process is coming from model_bikes.R I have been gathering data from those eight cities for a couple of months, therefore I have a rich dataset which in reality is a bunch of datasets that can be used to apply data analysis to any of those european cities.

4.1.- Read DB

The first thing to do is to connect to the database where I stored data and get it all.

# Open connection
con <- dbConnect(SQLite(),dbname="bikes.sqlite3")
my_db <- src_sqlite( "bikes.sqlite3", create = FALSE)

# Get all data
all_data = dbGetQuery(con,"SELECT * FROM bikes_records_2")
str(all_data)
## 'data.frame':    1075007 obs. of  8 variables:
##  $ empty_slots: int  15 3 8 27 36 34 30 17 40 39 ...
##  $ free_bikes : int  20 26 32 0 4 1 0 3 0 1 ...
##  $ id         : chr  "b1c774f178240d5d757854364dafc840" "135d1f3410c5dbcc49c55a1d522dbbe6" "153ff4dfb7bd8912ef91c10849129c2e" "516ee41c56889c29b8fa5f4ac0e824f4" ...
##  $ latitude   : num  53.3 53.3 53.3 53.3 53.3 ...
##  $ longitude  : num  -6.25 -6.25 -6.25 -6.3 -6.31 ...
##  $ name       : chr  "FENIAN STREET" "CITY QUAY" "FITZWILLIAM SQUARE EAST" "BROOKFIELD ROAD" ...
##  $ timestamp  : chr  "2016-10-17T12:47:15.653000Z" "2016-10-17T12:47:15.673000Z" "2016-10-17T12:47:15.755000Z" "2016-10-17T12:47:15.625000Z" ...
##  $ company_id : chr  "dublin" "dublin" "dublin" "dublin" ...

A subset of all_data.

empty_slots free_bikes id latitude longitude name timestamp company_id
15 20 b1c774f178240d5d757854364dafc840 53.34143 -6.246720 FENIAN STREET 2016-10-17T12:47:15.653000Z dublin
3 26 135d1f3410c5dbcc49c55a1d522dbbe6 53.34664 -6.246154 CITY QUAY 2016-10-17T12:47:15.673000Z dublin
8 32 153ff4dfb7bd8912ef91c10849129c2e 53.33521 -6.250900 FITZWILLIAM SQUARE EAST 2016-10-17T12:47:15.755000Z dublin
27 0 516ee41c56889c29b8fa5f4ac0e824f4 53.33901 -6.300217 BROOKFIELD ROAD 2016-10-17T12:47:15.625000Z dublin
36 4 6bff19daf1747a1e956a899ef8021232 53.34071 -6.308191 EMMET ROAD 2016-10-17T12:47:15.766000Z dublin
34 1 11b359924d83d802a41856c405d13e61 53.33878 -6.303950 ROTHE ABBEY 2016-10-17T12:47:15.744000Z dublin

Let’s apply some time transformations.

# Insert dates columns
#all_data$timestamp = as.Date(all_data$timestamp)
all_data$time_2 = ymd_hms(all_data$timestamp)
all_data$week_day = wday(all_data$time_2,label = TRUE)
all_data$hour = hour(all_data$time_2)
all_data$shift = cut(all_data$hour,c(0,7,15,23),c("night","morning","afternoon"))

A subset of all_data.

empty_slots free_bikes id latitude longitude name timestamp company_id time_2 week_day hour shift
15 20 b1c774f178240d5d757854364dafc840 53.34143 -6.246720 FENIAN STREET 2016-10-17T12:47:15.653000Z dublin 2016-10-17 12:47:15 Mon 12 morning
3 26 135d1f3410c5dbcc49c55a1d522dbbe6 53.34664 -6.246154 CITY QUAY 2016-10-17T12:47:15.673000Z dublin 2016-10-17 12:47:15 Mon 12 morning
8 32 153ff4dfb7bd8912ef91c10849129c2e 53.33521 -6.250900 FITZWILLIAM SQUARE EAST 2016-10-17T12:47:15.755000Z dublin 2016-10-17 12:47:15 Mon 12 morning
27 0 516ee41c56889c29b8fa5f4ac0e824f4 53.33901 -6.300217 BROOKFIELD ROAD 2016-10-17T12:47:15.625000Z dublin 2016-10-17 12:47:15 Mon 12 morning
36 4 6bff19daf1747a1e956a899ef8021232 53.34071 -6.308191 EMMET ROAD 2016-10-17T12:47:15.766000Z dublin 2016-10-17 12:47:15 Mon 12 morning
34 1 11b359924d83d802a41856c405d13e61 53.33878 -6.303950 ROTHE ABBEY 2016-10-17T12:47:15.744000Z dublin 2016-10-17 12:47:15 Mon 12 morning

We can define a capacity status in terms of how many bikes are being used in a given station. We define a station as overflow when there is few empty slots to park. As far as I have seen is other similar analysis, thresholds to score a station as overflow, balanced or shortage are subjective. In this analysis, stations with empty slots from 0% to 25% are scored as overflow. Stations with empty slots from 75% to 100% are scored as shortage and everything in between is balanced.

# Total slots ans shifts calculations
all_data$total_slots = all_data$empty_slots + all_data$free_bikes

all_data$empty_status = (all_data$empty_slots / all_data$total_slots) * 100
all_data$empty_status_label = cut(all_data$empty_status,c(0,25,75,100),c("overflow","balanced","shortage"))

all_data$empty_status_label[all_data$empty_status==0] <- "overflow"

A subset of all_data.

empty_slots free_bikes id latitude longitude name timestamp company_id time_2 week_day hour shift total_slots empty_status empty_status_label
15 20 b1c774f178240d5d757854364dafc840 53.34143 -6.246720 FENIAN STREET 2016-10-17T12:47:15.653000Z dublin 2016-10-17 12:47:15 Mon 12 morning 35 42.85714 balanced
3 26 135d1f3410c5dbcc49c55a1d522dbbe6 53.34664 -6.246154 CITY QUAY 2016-10-17T12:47:15.673000Z dublin 2016-10-17 12:47:15 Mon 12 morning 29 10.34483 overflow
8 32 153ff4dfb7bd8912ef91c10849129c2e 53.33521 -6.250900 FITZWILLIAM SQUARE EAST 2016-10-17T12:47:15.755000Z dublin 2016-10-17 12:47:15 Mon 12 morning 40 20.00000 overflow
27 0 516ee41c56889c29b8fa5f4ac0e824f4 53.33901 -6.300217 BROOKFIELD ROAD 2016-10-17T12:47:15.625000Z dublin 2016-10-17 12:47:15 Mon 12 morning 27 100.00000 shortage
36 4 6bff19daf1747a1e956a899ef8021232 53.34071 -6.308191 EMMET ROAD 2016-10-17T12:47:15.766000Z dublin 2016-10-17 12:47:15 Mon 12 morning 40 90.00000 shortage
34 1 11b359924d83d802a41856c405d13e61 53.33878 -6.303950 ROTHE ABBEY 2016-10-17T12:47:15.744000Z dublin 2016-10-17 12:47:15 Mon 12 morning 35 97.14286 shortage

Now it’s time to focus on a single city, in this case, London.

# One City
ci = "london" # line 38 in model bikes.R

ci_df = all_data %>%
  filter(company_id == ci)

dim(ci_df)
## [1] 274291     15
ci_df = na.omit(ci_df)
dim(ci_df)
## [1] 271950     15

A subset of ci_df.

empty_slots free_bikes id latitude longitude name timestamp company_id time_2 week_day hour shift total_slots empty_status empty_status_label
5 11 d79e95513c3b5e1d3c78a65874054481 51.49887 -0.1374246 200178 - Buckingham Gate, Westminster 2016-10-17T14:08:07.773000Z london 2016-10-17 14:08:07 Mon 14 morning 16 31.25000 balanced
20 0 090f297e79c030688a8a51755ebeafe3 51.50172 -0.1798542 001231 - Queen’s Gate, Kensington Gardens 2016-10-17T14:08:06.788000Z london 2016-10-17 14:08:06 Mon 14 morning 20 100.00000 shortage
9 9 7f3020118e56165ed8b2f61899edb971 51.52916 -0.1099705 001023 - River Street , Clerkenwell 2016-10-17T14:08:04.972000Z london 2016-10-17 14:08:04 Mon 14 morning 18 50.00000 balanced
13 22 67e6c16bce05410ba4b1f0f5000726ea 51.49961 -0.1975742 001018 - Phillimore Gardens, Kensington 2016-10-17T14:08:04.980000Z london 2016-10-17 14:08:04 Mon 14 morning 35 37.14286 balanced
15 8 d0ca154129878ce71261d563d8da9238 51.53006 -0.1209737 001013 - St. Chad’s Street, King’s Cross 2016-10-17T14:08:04.989000Z london 2016-10-17 14:08:04 Mon 14 morning 23 65.21739 balanced
7 19 8e74e96bcc8d74d7621f8201cee921dd 51.49313 -0.1568760 003420 - Sedding Street, Sloane Square 2016-10-17T14:08:04.993000Z london 2016-10-17 14:08:04 Mon 14 morning 26 26.92308 balanced

Mean capacity by weekday, hour and station.

A subset of ci_df_ref.

week_day hour name avg_used_capacity
Sun 8 000970 - Scala Street, Fitzrovia 100
Sun 8 000999 - Queen Street 1, Bank 100
Sun 8 001006 - Bayley Street , Bloomsbury 100
Sun 8 001016 - Hereford Road, Bayswater 100
Sun 8 001043 - Museum of London, Barbican 100
Sun 8 001048 - Queen Victoria Street, St. Paul’s 100

We can apply a rank technique to get X number of stations:

A subset of top_stations_weekday.

week_day hour name avg_used_capacity
Sun 8 000970 - Scala Street, Fitzrovia 100
Sun 8 000999 - Queen Street 1, Bank 100
Sun 8 001006 - Bayley Street , Bloomsbury 100
Sun 8 001016 - Hereford Road, Bayswater 100
Sun 8 001043 - Museum of London, Barbican 100
Sun 8 001048 - Queen Victoria Street, St. Paul’s 100

A subset of bottom_stations_weekday.

week_day hour name avg_used_capacity
85211 Fri 4 200086 - Ilchester Gardens, Bayswater 0
85212 Fri 4 200105 - Bonny Street, Camden Town 0
85213 Fri 4 200118 - Parkway, Camden Town 0
85214 Fri 4 200146 - Newby Place, Poplar 0
85215 Fri 4 200182 - Rainville Road, Hammersmith 0
85216 Fri 4 200211 - Union Grove, Wandsworth Road 0

Although it’s simple to get that top and bottom used stations grouping just by station:

See top and bottom used stations:

##                                          name avg_used_capacity
## 1          300079 - Manfred Road, East Putney          91.54527
## 2            200177 - Santos Road, Wandsworth          91.00370
## 3              300046 - Alma Road, Wandsworth          90.65369
## 4                  200174 - Mostyn Grove, Bow          89.82960
## 5       300026 - All Saints' Road, Portobello          89.63444
## 6         200063 - Mexfield Road, East Putney          89.46055
## 7  200220 - Blythe Road West, Shepherd's Bush          88.22927
## 8           200186 - Nantes Close, Wandsworth          88.12582
## 9            200219 - Osiers Road, Wandsworth          88.06595
## 10            200098 - Aintree Street, Fulham          87.16813
##                                       name avg_used_capacity
## 769   022183 - Crinan Street, King's Cross          26.13413
## 770  300234 - Stratford Station, Stratford          25.49022
## 771    001114 - Pont Street, Knightsbridge          25.47857
## 772      002695 - Cadogan Gardens, Chelsea          24.60526
## 773      200238 - Westferry DLR, Limehouse          24.47227
## 774          200050 - Naval Row, Blackwall          24.27386
## 775    200193 - Snowsfields, London Bridge          23.56908
## 776      300022 - Evesham Street, Avondale          21.96792
## 777          200118 - Parkway, Camden Town          21.89531
## 778 003420 - Sedding Street, Sloane Square          18.56592

If we want to have a better view, we can geocode those stations.

A subset of top_stations_geo.

name avg_used_capacity latitude longitude
200063 - Mexfield Road, East Putney 89.46055 51.45682 -0.2028021
200098 - Aintree Street, Fulham 87.16813 51.48102 -0.2099735
200174 - Mostyn Grove, Bow 89.82960 51.53054 -0.0254920
200177 - Santos Road, Wandsworth 91.00370 51.45706 -0.2008063
200186 - Nantes Close, Wandsworth 88.12582 51.46193 -0.1807918
200219 - Osiers Road, Wandsworth 88.06595 51.46067 -0.1987354

A subset of bottom_stations_geo.

name avg_used_capacity latitude longitude
001114 - Pont Street, Knightsbridge 25.47857 51.49689 -0.1612038
002695 - Cadogan Gardens, Chelsea 24.60526 51.49246 -0.1599190
003420 - Sedding Street, Sloane Square 18.56592 51.49313 -0.1568760
022183 - Crinan Street, King’s Cross 26.13413 51.53447 -0.1222030
200050 - Naval Row, Blackwall 24.27386 51.50898 -0.0069900
200118 - Parkway, Camden Town 21.89531 51.53807 -0.1446640

Now that we have geocoded top and bottom stations, we can plot them in a map.

## Source : https://maps.googleapis.com/maps/api/staticmap?center=London&zoom=12&size=640x640&scale=2&maptype=terrain
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=London
## Source : http://tile.stamen.com/terrain-lines/12/2045/1360.png
## Source : http://tile.stamen.com/terrain-lines/12/2046/1360.png
## Source : http://tile.stamen.com/terrain-lines/12/2047/1360.png
## Source : http://tile.stamen.com/terrain-lines/12/2045/1361.png
## Source : http://tile.stamen.com/terrain-lines/12/2046/1361.png
## Source : http://tile.stamen.com/terrain-lines/12/2047/1361.png
## Source : http://tile.stamen.com/terrain-lines/12/2045/1362.png
## Source : http://tile.stamen.com/terrain-lines/12/2046/1362.png
## Source : http://tile.stamen.com/terrain-lines/12/2047/1362.png
## Source : http://tile.stamen.com/terrain-lines/12/2045/1363.png
## Source : http://tile.stamen.com/terrain-lines/12/2046/1363.png
## Source : http://tile.stamen.com/terrain-lines/12/2047/1363.png

## Warning: Removed 2 rows containing missing values (geom_point).

We can plot the top and bottom stations by week day.

## Warning: Removed 7 rows containing missing values (geom_point).

## Warning: Removed 12 rows containing missing values (geom_point).

A subset of ci_df_3.

name avg_used_capacity
300079 - Manfred Road, East Putney 91.54527
200177 - Santos Road, Wandsworth 91.00370
300046 - Alma Road, Wandsworth 90.65369
200174 - Mostyn Grove, Bow 89.82960
300026 - All Saints’ Road, Portobello 89.63444
200063 - Mexfield Road, East Putney 89.46055

A subset of ci_df_3_geo.

name avgcapacity latitude longitude
000959 - Milroy Walk, South Bank 33.42063 51.50724 -0.1062375
000960 - Hop Exchange, The Borough 56.37899 51.50463 -0.0917738
000961 - Union Street, The Borough 44.98475 51.50369 -0.0984977
000962 - Stamford Street, South Bank 52.78369 51.50557 -0.1116067
000963 - Bankside Mix, Bankside 44.83317 51.50582 -0.1001863
000964 - Bath Street, St. Luke’s 61.87755 51.52589 -0.0908478
## Warning: Removed 27 rows containing missing values (geom_point).

Adding up the empty slots of stations can give us a good proxy of total bikes being used.

##   hour bicis_circulando   city
## 1   10           430339 london
## 2    8           420704 london
## 3   11           403440 london
## 4   12           381153 london
## 5    9           379578 london
## 6   13           379430 london

A subset of circulando_df.

hour bicis_circulando city
10 430339 london
8 420704 london
11 403440 london
12 381153 london
9 379578 london
13 379430 london

And then we can plot it.

We can also incorporate population data.

## 'data.frame':    8 obs. of  1 variable:
##  $ unique(all_data$company_id): Factor w/ 8 levels "barcelona","budapest",..: 3 4 1 5 7 8 6 2
##        city population
## 1    dublin     553165
## 2  helsinki     630752
## 3 barcelona    1604555
## 4  istanbul   14025646
## 5    milano    1347707
## 6     paris    2249975

A subset of ci.

london

A subset of circulando_df.

city population
dublin 553165
helsinki 630752
barcelona 1604555
istanbul 14025646
milano 1347707
paris 2249975

And add a per_mil_capita metric.

A subset of cities_df_pop.

city hour bicis_circulando population popu_mil per_mil_capita
london 10 430339 8673713 8673.713 49.61416
london 8 420704 8673713 8673.713 48.50333
london 11 403440 8673713 8673.713 46.51295
london 12 381153 8673713 8673.713 43.94346
london 9 379578 8673713 8673.713 43.76188
london 13 379430 8673713 8673.713 43.74482

Let’s plot bikes per mil capita vs average used capacity.

Summary

A subset of cities_df_ave.

hour city bicis_circulando population popu_mil per_mil_capita ave_cap_h
1 london 33592 8673713 8673.713 3.872851 55.21850
2 london 33583 8673713 8673.713 3.871814 55.20689
3 london 33618 8673713 8673.713 3.875849 55.23499
4 london 33670 8673713 8673.713 3.881844 55.30567
5 london 34156 8673713 8673.713 3.937875 56.09910
6 london 47064 8673713 8673.713 5.426050 57.85625
7 london 270192 8673713 8673.713 31.150673 57.91823
8 london 420704 8673713 8673.713 48.503334 57.56097
9 london 379578 8673713 8673.713 43.761881 55.37769
10 london 430339 8673713 8673.713 49.614162 55.05346
11 london 403440 8673713 8673.713 46.512952 55.18604
12 london 381153 8673713 8673.713 43.943465 54.91695
13 london 379430 8673713 8673.713 43.744818 54.84079
14 london 300851 8673713 8673.713 34.685376 54.72947
15 london 289957 8673713 8673.713 33.429398 54.99840
16 london 151349 8673713 8673.713 17.449159 57.33982
17 london 105286 8673713 8673.713 12.138516 57.58917
18 london 68430 8673713 8673.713 7.889355 56.21981
19 london 34129 8673713 8673.713 3.934762 56.00120
20 london 33992 8673713 8673.713 3.918968 55.76749
21 london 33805 8673713 8673.713 3.897408 55.60826
22 london 44921 8673713 8673.713 5.178982 55.50966
23 london 33686 8673713 8673.713 3.883689 55.37428

A subset of top_stations_geo.

name avg_used_capacity latitude longitude
200063 - Mexfield Road, East Putney 89.46055 51.45682 -0.2028021
200098 - Aintree Street, Fulham 87.16813 51.48102 -0.2099735
200174 - Mostyn Grove, Bow 89.82960 51.53054 -0.0254920
200177 - Santos Road, Wandsworth 91.00370 51.45706 -0.2008063
200186 - Nantes Close, Wandsworth 88.12582 51.46193 -0.1807918
200219 - Osiers Road, Wandsworth 88.06595 51.46067 -0.1987354
200220 - Blythe Road West, Shepherd’s Bush 88.22927 51.49968 -0.2217916
300026 - All Saints’ Road, Portobello 89.63444 51.51904 -0.2047644
300046 - Alma Road, Wandsworth 90.65369 51.45799 -0.1848062
300079 - Manfred Road, East Putney 91.54527 51.45816 -0.2060025

A subset of bottom_stations_geo.

name avg_used_capacity latitude longitude
001114 - Pont Street, Knightsbridge 25.47857 51.49689 -0.1612038
002695 - Cadogan Gardens, Chelsea 24.60526 51.49246 -0.1599190
003420 - Sedding Street, Sloane Square 18.56592 51.49313 -0.1568760
022183 - Crinan Street, King’s Cross 26.13413 51.53447 -0.1222030
200050 - Naval Row, Blackwall 24.27386 51.50898 -0.0069900
200118 - Parkway, Camden Town 21.89531 51.53807 -0.1446640
200193 - Snowsfields, London Bridge 23.56908 51.50215 -0.0836329
200238 - Westferry DLR, Limehouse 24.47227 51.50930 -0.0259960
300022 - Evesham Street, Avondale 21.96792 51.51149 -0.2181902
300234 - Stratford Station, Stratford 25.49022 51.54179 -0.0048100

A subset of ci_df.

empty_slots free_bikes id latitude longitude name timestamp company_id time_2 week_day hour shift total_slots empty_status empty_status_label
5 11 d79e95513c3b5e1d3c78a65874054481 51.49887 -0.1374246 200178 - Buckingham Gate, Westminster 2016-10-17T14:08:07.773000Z london 2016-10-17 14:08:07 Mon 14 morning 16 31.25000 balanced
20 0 090f297e79c030688a8a51755ebeafe3 51.50172 -0.1798542 001231 - Queen’s Gate, Kensington Gardens 2016-10-17T14:08:06.788000Z london 2016-10-17 14:08:06 Mon 14 morning 20 100.00000 shortage
9 9 7f3020118e56165ed8b2f61899edb971 51.52916 -0.1099705 001023 - River Street , Clerkenwell 2016-10-17T14:08:04.972000Z london 2016-10-17 14:08:04 Mon 14 morning 18 50.00000 balanced
13 22 67e6c16bce05410ba4b1f0f5000726ea 51.49961 -0.1975742 001018 - Phillimore Gardens, Kensington 2016-10-17T14:08:04.980000Z london 2016-10-17 14:08:04 Mon 14 morning 35 37.14286 balanced
15 8 d0ca154129878ce71261d563d8da9238 51.53006 -0.1209737 001013 - St. Chad’s Street, King’s Cross 2016-10-17T14:08:04.989000Z london 2016-10-17 14:08:04 Mon 14 morning 23 65.21739 balanced
7 19 8e74e96bcc8d74d7621f8201cee921dd 51.49313 -0.1568760 003420 - Sedding Street, Sloane Square 2016-10-17T14:08:04.993000Z london 2016-10-17 14:08:04 Mon 14 morning 26 26.92308 balanced

Prediction Model

After a succint analysis, it’s time to create our prediction model. The goal is to predict a station status (overflow, balanced or shortage) in a given week day and hour.

## 'data.frame':    271950 obs. of  11 variables:
##  $ empty_slots       : int  5 20 9 13 15 7 6 16 13 3 ...
##  $ free_bikes        : int  11 0 9 22 8 19 12 0 5 21 ...
##  $ latitude          : num  51.5 51.5 51.5 51.5 51.5 ...
##  $ longitude         : num  -0.137 -0.18 -0.11 -0.198 -0.121 ...
##  $ hour              : int  14 14 14 14 14 14 14 14 14 14 ...
##  $ empty_status      : int  31 100 50 37 65 26 33 100 72 12 ...
##  $ empty_status_label: Factor w/ 3 levels "overflow","balanced",..: 2 3 2 2 2 2 2 3 2 1 ...
##  $ id                : chr  "d79e95513c3b5e1d3c78a65874054481" "090f297e79c030688a8a51755ebeafe3" "7f3020118e56165ed8b2f61899edb971" "67e6c16bce05410ba4b1f0f5000726ea" ...
##  $ name              : chr  "200178 - Buckingham Gate, Westminster" "001231 - Queen's Gate, Kensington Gardens" "001023 - River Street , Clerkenwell" "001018 - Phillimore Gardens, Kensington" ...
##  $ shift             : Factor w/ 3 levels "night","morning",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ week_day          : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tues"<..: 2 2 2 2 2 2 2 2 2 2 ...
## [1] 271950     11

Now I will normalize data so it can fit in a prediction model.

## 
## overflow balanced shortage 
##    57173   126060    88717

A subset of df1_norm.

empty_slots free_bikes latitude longitude hour
0.0793651 0.1746032 0.4662308 0.4236568 0.5909091
0.3174603 0.0000000 0.4963503 0.2427163 0.5909091
0.1428571 0.1428571 0.7864481 0.5407341 0.5909091
0.2063492 0.3492063 0.4740658 0.1671494 0.5909091
0.2380952 0.1269841 0.7959171 0.4938113 0.5909091
0.1111111 0.3015873 0.4056110 0.3407064 0.5909091

Basically the model will classify a station depending on the number of free slots, number of available bikes, latitude, longitude and hour of the day. Split the dataset in two sets, 70% for training and 30% for testing.

## [1] 190365      5
## [1] 81585     5

Let’s train and test the model. I will use knn with k = 21.

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  81585 
## 
##  
##                 | df1_test_pred_2 
## df1_test_labels |  overflow |  balanced |  shortage | Row Total | 
## ----------------|-----------|-----------|-----------|-----------|
##        overflow |     16475 |       573 |         0 |     17048 | 
##                 | 47359.194 |  6830.008 |  5577.978 |           | 
##                 |     0.966 |     0.034 |     0.000 |     0.209 | 
##                 |     0.974 |     0.015 |     0.000 |           | 
##                 |     0.202 |     0.007 |     0.000 |           | 
## ----------------|-----------|-----------|-----------|-----------|
##        balanced |       444 |     36912 |       584 |     37940 | 
##                 |  7005.008 | 20993.024 | 11273.158 |           | 
##                 |     0.012 |     0.973 |     0.015 |     0.465 | 
##                 |     0.026 |     0.972 |     0.022 |           | 
##                 |     0.005 |     0.452 |     0.007 |           | 
## ----------------|-----------|-----------|-----------|-----------|
##        shortage |         0 |       487 |     26110 |     26597 | 
##                 |  5515.654 | 11424.166 | 34821.290 |           | 
##                 |     0.000 |     0.018 |     0.982 |     0.326 | 
##                 |     0.000 |     0.013 |     0.978 |           | 
##                 |     0.000 |     0.006 |     0.320 |           | 
## ----------------|-----------|-----------|-----------|-----------|
##    Column Total |     16919 |     37972 |     26694 |     81585 | 
##                 |     0.207 |     0.465 |     0.327 |           | 
## ----------------|-----------|-----------|-----------|-----------|
## 
## 

A subset of df1_test.

empty_slots free_bikes latitude longitude hour empty_status empty_status_label id name shift week_day label_pred
1 5 11 51.49887 -0.1374246 14 31 balanced d79e95513c3b5e1d3c78a65874054481 200178 - Buckingham Gate, Westminster morning Mon balanced
2 20 0 51.50172 -0.1798542 14 100 shortage 090f297e79c030688a8a51755ebeafe3 001231 - Queen’s Gate, Kensington Gardens morning Mon shortage
4 13 22 51.49961 -0.1975742 14 37 balanced 67e6c16bce05410ba4b1f0f5000726ea 001018 - Phillimore Gardens, Kensington morning Mon balanced
10 3 21 51.52395 -0.1225023 14 12 overflow 1eabd7ac8e781befd03f52ef56a18aa7 001022 - Brunswick Square, Bloomsbury morning Mon overflow
13 42 2 51.52994 -0.1236168 14 95 shortage 025ecd9d4f9708f515a36660227c8f3f 001011 - Belgrove Street , King’s Cross morning Mon shortage
16 21 9 51.52505 -0.1311611 14 70 balanced ebe16a8a0722ff26b583e2bda9cd1af4 001009 - Taviton Street, Bloomsbury morning Mon balanced

As you can see in the previous table, the model predicts quite well.

##           
##            overflow balanced shortage
##   overflow    16475      573        0
##   balanced      444    36912      584
##   shortage        0      487    26110

A subset of df1_test.

empty_slots free_bikes latitude longitude hour empty_status empty_status_label id name shift week_day label_pred pred
1 5 11 51.49887 -0.1374246 14 31 balanced d79e95513c3b5e1d3c78a65874054481 200178 - Buckingham Gate, Westminster morning Mon balanced 1
2 20 0 51.50172 -0.1798542 14 100 shortage 090f297e79c030688a8a51755ebeafe3 001231 - Queen’s Gate, Kensington Gardens morning Mon shortage 1
4 13 22 51.49961 -0.1975742 14 37 balanced 67e6c16bce05410ba4b1f0f5000726ea 001018 - Phillimore Gardens, Kensington morning Mon balanced 1
10 3 21 51.52395 -0.1225023 14 12 overflow 1eabd7ac8e781befd03f52ef56a18aa7 001022 - Brunswick Square, Bloomsbury morning Mon overflow 1
13 42 2 51.52994 -0.1236168 14 95 shortage 025ecd9d4f9708f515a36660227c8f3f 001011 - Belgrove Street , King’s Cross morning Mon shortage 1
16 21 9 51.52505 -0.1311611 14 70 balanced ebe16a8a0722ff26b583e2bda9cd1af4 001009 - Taviton Street, Bloomsbury morning Mon balanced 1
## [1] 2088   13

A subset of df1_test_wrong.

empty_slots free_bikes latitude longitude hour empty_status empty_status_label id name shift week_day label_pred pred
34 14 5 51.52334 -0.1838464 14 73 balanced ee5b657420a4bbb92081e8a2b7a401fc 001027 - Warwick Avenue Station, Maida Vale morning Mon shortage 0
278 10 3 51.52289 -0.1716820 14 76 shortage 8b1f02423f3951fd6c337014465d7f7f 010625 - Penfold Street, Marylebone morning Mon balanced 0
279 4 12 51.51689 -0.1582499 14 25 overflow 23ee76ce7a5418e315a92a4f79ea12a2 010630 - George Place Mews, Marylebone morning Mon balanced 0
742 4 12 51.50501 -0.1727296 14 25 overflow 5fb0ac0fdd9f54ba0b041d8ebba61c58 001217 - Serpentine Car Park, Hyde Park morning Mon balanced 0
792 6 19 51.52166 -0.1090063 14 24 overflow fbd79d7e9e1996ed0e9955ed4f31b33e 001008 - Hatton Wall, Holborn morning Mon balanced 0
852 9 24 51.51159 -0.0771213 14 27 balanced 6be38ccc37cd0c70d4dabccf654dc86f 000991 - Crosswall, Tower morning Mon overflow 0

Finally we can create a simple tool to search a given station and see its real and predicted status in a given day and hour.

## [1] 778

A subset of places.

d79e95513c3b5e1d3c78a65874054481
090f297e79c030688a8a51755ebeafe3
67e6c16bce05410ba4b1f0f5000726ea
1eabd7ac8e781befd03f52ef56a18aa7
025ecd9d4f9708f515a36660227c8f3f
ebe16a8a0722ff26b583e2bda9cd1af4
3704deafdc9a47f1a3b445511be1477b
43a7f48d18b5950311a758a86869c23b
0c01f449802222cdd504c8eafa0322a4
ad43baa18c9dbea9ba4b0966c386cc0c

A subset of places_names.

200178 - Buckingham Gate, Westminster
001231 - Queen’s Gate, Kensington Gardens
001018 - Phillimore Gardens, Kensington
001022 - Brunswick Square, Bloomsbury
001011 - Belgrove Street , King’s Cross
001009 - Taviton Street, Bloomsbury
003425 - Northington Street , Holborn
000981 - British Museum, Bloomsbury
001019 - Ampton Street , Clerkenwell
001029 - Windsor Terrace, Hoxton
##           
##            overflow balanced shortage
##   overflow        0        0        0
##   balanced        0        1        0
##   shortage        0        0        0

A subset of selected_place.

empty_slots free_bikes latitude longitude hour empty_status empty_status_label id name shift week_day label_pred pred
6 12 51.51812 -0.1442289 11 33 balanced 7c43ac2c8fefa2a6e3e0e6b57654e7cf 003424 - Broadcasting House, Marylebone morning Mon balanced 1